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A A-BODY SOLVER FOR SQUARE ROOT ITERATION 

MATT CHALLACOMBE*t TERRY HAUT*AND NICOLAS BOCK* 


Abstract. We develop the Sparse Approximate Matrix Multiply (SpAMM) n-body solver for 
first order Newton Schulz iteration of the matrix square root and inverse square root. The solver 
performs recursive two-sided metric queries on a modified Cauchy-Schwarz criterion, culling negligible 
sub-volumes of the product-tensor for problems with structured decay in the sub-space metric. These 
sub-structures are shown to bound the relative error in the matrix-matrix product, and in favorable 
cases, to enjoy a reduced computational complexity governed by dimensionality reduction of the 
product volume. A main contribution is demonstration of a new, algebraic locality that develops under 
contractive identity iteration, with collapse of the metric-subspace onto the identity’s plane diagonal, 
resulting in a stronger SpAMM bound. Also, we carry out a first order Frechet analyses for single and 
dual channel instances of the square root iteration, and look at bifurcations due to ill-conditioning 
and a too aggressive SpAMM approximation. Then, we show that extreme SpAMM approximation and 
contractive identity iteration can be achieved for ill-conditioned systems through regularization, and 
we demonstrate the potential for acceleration with a scoping, product representation of the inverse 
factor. 


1. Introduction. In many areas of current numerical interest, matrix equations 
with decay properties describe correlations over a range of scales. By decay, we mean 
an approximate inverse relationship between a matrix element’s magnitude and an 
associated distance; this might be a slow inverse exponential relationship between 
matrix elements and a Cartesian separation, or it might involve a non-Euclidean 
distance, e.g. between character strings. 

A common approach to exploiting matrix decay involves sparse approximation 
of inverse factors that transform Gramian equations to a representation independent 
form, via congruence transformations based on Lowdin’s symmetric orthogonalization 
(the matrix inverse square root) [85, 91], inverse Cholesky factorization [75] or related 
transformations that involve an inverse or pseudo-inverse [61, 18, 55, 56]. Gramian 
inverse factors with decay are ubiquitous to problems with local, non-orthogonal 
support, including finite element calculations [40, 59], radial-basis-function finite- 
difference calculations [114, 110], in the “direct” approach to radial-basis interpolation 
[106], with frames [47, 63], with computation involving “lets” of various types [55, 56], 
and in the Atomic Orbital (AO) representation [71, 64]. 

Off-diagonal decay of the matrix sign function is also a well developed area of 
study in statistics and statistical physics [98, 117, 6, 60, 76], and in electronic structure, 
where sparse approximation enables fast computation of the the gap shifted matrix 
sign function as projector of an effective Hamiltonian [17, 13, 22, 14]. Short to long 
ranged decay properties of the projector are shown in Fig. 1.1. These matrix functions, 
the matrix sign function and the matrix inverse square root, are related by Higham’s 
identity [66]: 


sign 


( 

'0 s' 


1 

o 

)■[ 


0 

- 1/2 


^ 1 / 2 - 

0 


( 1 . 1 ) 


A well conditioned matrix s may often correspond to matrix sign and inverse 
square root functions with rapid exponential decay, and be amenable to ad hoc 
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Fig. 1.1. Examples from electronic structure of decay for the spectral projector (gap shifted sign 
function) with respect to the local (atomic) support. Shown is decay for systems with correlations 
that are short (insulating water), medium (semi-conducting 4^3 nanotube), and long (metallic 3,3 
nanotube) ranged, from exponential (insulating) to algebraic (metallic). 


matrix truncation or “sparsification”, s = s + e®, where e® is the error introduced 
according to some criterion r, supported by useful bounds to matrix function elements 
[15, 16, 97, 59, 26]. The criterion r might be a drop-tolerance, e® = {—| \sij\ < r}, 
a radial cutoff, e® = {—Sij * | \\ri — rj\\ > r}, or some other approach to truncation, 

perhaps involving a sparsity pattern chosen a priori for computational expedience. 
Then, the sparse general matrix-matrix multiply (SpGEMM) [58, 113, 28, 21] may be 
employed, yielding fast solutions for multiplication rich iterations, and with fill-in 
modulated by truncation. Exhaustive surveys of these methods in the numerical linear 
algebra are given by Benzi [17, 13], and by Bowler [22] and Benzi [14] for electronic 
structure. 

In addition to sparsity, data localities leading to high operation counts are essential 
for kernels like the SpGEMM and their distributed implementations. Over the past 
decades, methods have evolved from bandwidth reduction (Cuthill-McKee) -h greedy 
blocking [113], progressing with tours of the graph via heuristic solutions to the 
Traveling Salesman Problem (TSP) [100, 3, 81], and more recently towards reordering 
based on cache modeling and dynamic sampling [48, 99]. Ordering with graph 
partitioning, targeting the load balance, may also lead to exploitable localities, via 
e.g. proximity to the diagonal [24]. Of current interest are ordering schemes that 
enhance the weighted block-locality of the Page Rank problem [73, 39, 82, 128]. 

Matrix locality may also result from an ordering that preserves locality in an 
auxiliary representation, a property of sub-space mappings that preserve local neigh¬ 
borhoods [10, 11, 12]. In the case of electronic structure. Space Filling Curve (SEC) 
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heuristics applied to a local Cartesian basis results in Gramian matrices with neighbor¬ 
hoods segregated by magnitude [28, 23], as shown in Fig. (1.2). Likewise, Sierpinski 
curves and Self Avoiding Walks on meshes lead to locality preserving orderings [62, 7], 
for e.g. finite elements [93, 108]. This type of weighted block-locality or “Block-By- 
Magnitude” (BBM) structure of the subspace metric \\'\\f is finely resolved with the 
quadtree matrix [125, 1, 101, 124, 126, 84]; 


^00 

^*+1' 
^01 

^10 

^11 J 


( 1 . 2 ) 


where i is the recursion depth, and 

llaik = \J\KV\\% + \KV\\% + ll«S'lll + ll«Tlll , (1-3) 

is the sub-multiplicative Frobenius norm [51, 65, 72]. 




Fig. 1.2. At left, Block-By-Magnitude (BBM) structure of a quantum chemical Gramian 
(overlap matrix), for a box of 100 water molecules, with Cartesian support along a locality preserving 
curve. At right, quadtree resolution of neighborhoods with norms down to 10 “^. 

However, despite structuring for cache, distributed memory or to enhance BBM 
structuring, matrix truncation may still be ineffective for ill-conditioned problems, 
because the rate of decay may be too slow, and also because of increased numerical 
sensitivities to the sparse approximation: 

a b = a b + e^ b + a ■ + 0{r‘^), (1.4) 

allowing to control only absolute errors. An alternative approach is to find a reduced 
rank approximation, ideally closed under the operations of interest. However, rank 
reduction may be expensive if the rank is not much, much smaller than the dimension. 
Interestingly, in the ultra-fiat limit, kernel methods enjoy rank reduction corresponding 
formally to change of basis, enabling fast methods for constructing the generalized 
inverse [107, 31]. In cases with simply slow exponential decay however, our experience 
has so far been that naive element dropping is about as effective as dropping singular 
values. 






































































































































































































































































































































































































































































































4 


Challacombe, Haut & Bock 


In this contribution, we consider the regime between trivial sparsity and formal 
rank reduction, with fast multiplications exploiting instead an accelerated volumetric 
decay in subspace-metric of the product-tensor. There, the SpAMM kernel carries out 
octree scoping of low-dimensional structures that bound the relative error while yielding 
a reduced complexity multiplication. Beyond decay associated with the the matrix 
square root and its inverse, we demonstrate additional compression of these bounding 
volumes under contractive identity iteration. 

This paper is organized as follows: In Section 2, we modify the SpAMM occlusion-cull 
to bound the relative product error. In Section 3, we review several instances of the 
first order Newton-Schulz (NS) square root iteration, and go over the contractive 
identity iteration that develops in the basin of stability. In Section 4, we overview 
a generic implementation of the SpAMM kernel and introduce quantum chemical and 
engineering data of interest. In Section 5, we develop a Frechet analysis for NS 
instances and the SpAMM algebra, and examine error flows in bifurcating and stable 
square root iterations for ill-conditioned problems. In Section 6 we show that even 
difficult, ill-conditioned problems can be brought to the regime of strongly contractive 
identity iteration, through iterative regularization and precision scoping. In Section 
7, we show for the first time the process of lensing, involving sub-space contraction 
to diagonal planes of the ijk-cnhe {i = i = k and/or j = k)^ followed finally by 
compression onto the identity’s plane diagonal, yielding additional orders of magnitude 
compression of SpAMM sub-volumes. Finally, in Section 8 we argue it may be possible to 
remain close to the lensed state whilst constructing a deferred product representation 
of the inverse factor. 

2. SpAMM. The Sparse Approximate Matrix Multiply (SpAMM) is a reduced com¬ 
plexity approximation that evolved from a row-column skipout mechanism within the 
blocked-compressed-sparse-row (BCSR) [27] and the distributed-blocked-compressed- 
row (DBCSR) data structures [29], to methods with fast subspace resolution through 
octree recursion [32, 19, 20]. Finding sub-spaces via fast range or metric query is 
a generic n-body problem handled with agility by the quadtree [77, 50, 102, 52], a 
problem related to spatial hashing [112, 80] and the occlusion-cull in visualization [96]. 

The SpAMM kernel 0 ^- provides fast approximate multiplication for matrices with 
decay and metric locality, with errors controlled by the scoping parameter r: 

a ■ b = a^rb = a ■ b . (2.1) 

As r ^ 0, SpAMM reverts to the recursive GEMM [57, 46]. 

In this work, we promote the following stable version of the SpAMM occlusion-cull: 

'0 if ||a^||r||<r||a||||b|| 
a® • 6® if(i = leaf) 

a®®^6® = 

fi®+l 4- fi®+l h®+l-l-«*+l^ h®+11 

^00 ' ^01 5 ^00 ^T^Ol ' ^01 

else 

< CZ-QO ^T^Ol "h ^01 ^T^ll 5 ^00 T CToi 

( 2 . 2 ) 

with II *11 = II*11^7 and the leaf condition determined by the block size, N}j. This scoping 
partitions the product tensor into two sub-spaces: the space of culled leaf-tasks, 
and its complement, the occlusion error A^'^ of avoided multiplications. This occlusion 
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error is bounded by 


iiAyii 

l|a|| l|b|l 


< , 


(2.3) 


as shown in the following section, a result commensurate with the stable, normwise 
multiplication criteria emphasized by Demel, Dumitriu, Holtz and Kleinberg (DDHK) 
[42]. 

2.1. Bound. We now prove Eq. (2.3): 

Proposition 2.1. Let Tab = r\\a\\\\b\\. Then for each ij, 


-b)ii - (a-b)-. < riTab, 


and 


\\a^rb - a • 6|| < Tab- 


Proof. We first show the following technical result: it is possible to choose 
aiij G {0,1} such that 


n 

(^(l<S>rbfj ^ ^ CLil bij 
1=1 


(2.4) 


In addition, if auj = 0, then |a^/| \bij\ < Tab- To show this, we use induction on the 
number /cmax of levels. 

First, if /Cmax = 0, 

0 if ||a|| ||6|| < Tab, 

a • b else. 

Therefore, is of the form (2.4) with either all auj = 0 or all auj = 1. Moreover, 

if aiij = 0, then \aii\ \bij\ < ||a|| ||6|| < Tab- 

Now assume that the claim holds for kmax — 1- We show that it holds for kmax- 
Indeed, if ||a|| ||6|| < Tab, we have that = 0, which is of the form (2.4) with all 

aiij = 0. Also, if aiij = 0, then |a^/| \bij\ < ||a|| ||6|| < Tab- 
Now assume that ||a|| ||6|| > Tab- Then 

, f O^00G)r^00 + Ci01<S>Tbl0 CLOO^rboi + aoi0r^ll 

Oj<^q-0 - I L I L U. \ U. 

\ 0tl0<^rO00 T CLll^r^lO CLiq^^Oio + ^110^-011 

We need to consider four cases: i < n/2 and j < n/2, i > n/2 and j > n/2, i > n/2 
and f < n/2, and, finally, i> nj2 and f > n/2. Since the analysis is similar for all 
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four cases, we only consider i <nl2 and j < n/2. We have that 


(a<S)rb)-j = (aoo0r^OO + CLoi<S>rblo)^j 
nl2 

= y] {aoo)ii {boo)ij a%j 
1=1 

nl2 

+ X] (“Ol)j; (f>lo)ij ajij 
1=1 

n 

— ^ -) 

1=1 


where we used the induction hypothesis in the second equality. 

Now suppose that auj = 0 for some 1. Then a^-j = 0 if / < n/2 or ij = 0 


if I > n/2. If, e.g., = 0, then \aii\ \bij\ = (aoi )^^^_^/2 (&io)z_n/ 2 j 

where we used the induction hypothesis in the final inequality. The analysis for I 
is similar, and the claim follows. 

We can now finish the proof of Proposition 2.1. Indeed, by (2.4), 


< n/2 


{a^Tb)ij - (a • 


b)ij <'^\aiibij\\auj - 1\ 


1 = 1 


= ^ \aiibij\. 

CXI 'ij — 0 


In addition, if auj = 0, then \aiibij\ < Tab and the lemma follows. 

□ 

2.2. Related research. SpAMM is perhaps most closely related to the Strassen- 
like branch of fast matrix multiplication [111, 41, 9, 79, 5], and also methods for group 
theoretical embedding allowing fast polynomial multiplication [38, 37, 115]. In the 
Strassen-like approach, disjoint volumes in high order tensor expansions of the product 
are recursively excluded, while in the SpAMM approach to fast multiplication, the 
subspace metric of the product tensor is recursively queried for occlusion of negligible 
volumes, with error bounded by Eq. (2.3). These methods for fast matrix multiplication 
are stable, satisfying the DDHK normwise product bound [41]. 

This work offers a data local alternative to fast non-deterministic methods for 
sampling the product, which include sketching [105, 44, 86, 94, 127], joining [90, 68, 70, 
34, 4, 83, 74], sensing [69] and probing [36]. These methods may involve probabilistic 
assumptions and on the fly sampling, with the potential for complexity reduction 
due to statistical approximations. SpAMM also employs on the fly weighted sampling, 
with compression through octree scoping of metric tensor decay, and with additional 
subspace compression due to the onset of identity iteration. 

SpAMM is related to the generalized n-body methods popularized by Gray [53, 54]. 
Here and in related research, we are interested in generic approaches to approximation 
that are data agnostic, based on the quadtree and its generalizations [77, 50, 102, 52] 
and and on the facile measure H-H = \\'\\f [72]. In this work, a fast two-sided metric 
query enables octree scoping with the occlusion criteria ||a*||||6*|| < r||a||||5||. With 
quantum chemical Fock exchange, a fast three-sided metric query enables hextree 
scoping with a related, Cauchy-Schwarz like occlusion criterion (direct SCF) [30]. It 
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may also be possible to exploit subspace locality more broadly, though mappings that 
optimally preserve local neighborhoods in higher dimensions, e.g. via the Laplace- 
Beltrami operator [10, 11, 12]. 

For distributed architectures, n-body methods offer well established protocols 
for turning spatial locality into data and temporal locality [119, 120, 121, 122, 118]. 
Recently, we showed strong scaling for the SpAMM kernel [20], while Driscoll et. al were 
able to show perfect strong scaling and communication optimally for pairwise n-body 
methods [45]. A uniform approach to generic scoping is empowered at the ecosystems 
level by runtime support for recursive task parallelism [43, 103, 78, 88, 104, 89, 87]. 

Finally, this work is inspired broadly by Higham’s work, particularly by Higham, 
Mackey, Mackey and Tisseur (HMMT) in 2005 [67] on square root iteration and the 
group structure of matrix functions. Also, it is inffuenced by Chen and Chow’s [33] 
approach to scaled NS iteration for ill-conditioned problems, and by the Helgaker 
group’s work on NS iteration, whose notation we follow in part [71]. 

3. Newton-Shulz Iterations. There are two common, first order NS iterations; 
the sign iteration and the square root iteration, related by the square / (•) = sign^ (•). 
These equivalent iterations converge linearly at first, then enter a basin of stability 
marked by super-linear convergence. 

3.1. Sign iteration. For the NS sign iteration, this basin is marked by a behav¬ 
ioral change in the difference SXf^ = Xj^ — Xj^ = sign {Xf^_i + SXi^_i) — sign (X/c_i), 
where SXk-i is some previous error. The change in behavior is associated with the 
onset of idempotence and the bounded eigenvalues of sign' (•), leading to stable itera¬ 
tion when sign' (Xk-i) 6Xk-i < 1. Global perturbative bounds on this iteration have 
been derived by Bai and Demmel [8], while Byers, He and Mehrmann [25] developed 
asymptotic bounds. The automatic stability of sign iteration is a well developed theme 
in Higham’s work [66]. 

3.2. Square root iteration. We are concerned with resolution of the identity 

/(s) = s1/2.^-1/2^ 

and its low-complexity computation with fast methods. 

Starting with eigenvalues rescaled to the domain (0,1] with the easily obtained 
largest eigenvalue, s ^ s/s^-i, and with Zq = I and Xq = = s, the corresponding 

canonical, “dual” channel square root iteration is: 

Vk ^ K [vk-i • ^k-i] • Vk-i 

^k — 1 ’ ha [Vk—l ' ^k — l\ : (^•^) 

converging as yj. Zk —> and Xk —> F, with eigenvalues aggregated 

towards 1 by the NS map ha[x] = ^ {3 — ax) [66, 67]. As in the case of sign 
iteration, this canonical iteration was shown by Higham, Mackey, Mackey and Tisseur 
[67] to remain strongly bounded in the super-linear regime, by idempotent Frechet 
derivatives about the fixed point , in the direction (^Syj^_i,Szk-i)’ 

^Vk = (3.3) 

5zk = \^^k-i - • ^Vk-i ■ • (3.4) 
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In addition to the dual channel instance, we also consider the “single” channel version 
of square root iteration, 


^ ^k—1 ' —l] 

Xk ^ zl • s • Zk . 


(3.5) 


4. Implementation. 

4.1. Programming. In our experimental research, issue driven implementations 
of SpAMM have been developed, including a Haskell version (formal functional pro¬ 
gramming) [35], a fine grained (4 x 4) single-precision assembly coded version 
(scalar performance) [19] and a task parallel version in C++, OpenMP 3.0 and Charm++ 
(strong scaling) [20]. In the current contribution, informal functional programming in 
FortranOS was used, with the goal of generic simplicity and mathematical agility. 

In our implementation, allocation functions instantiate or reuse sub-matrices in 
downward recursion, and accumulate decorations (flops, bounding boxes, non-zeros, 
norms, initialization flags etc.) in backwards recursion, up the stack. Optional, if def’d 
features include the first order Frechet analyses, outlined in Section 5.1 and using 
MATMUL, as well as sparse VTK output for visualization of the ijk product volumes, 
shown in Section 7. 

Precision is determined by the block dimension the primary threshold r 
controlling error in the 2 : and the x channels, and by the tighter (sensitive) threshold 
Ts for the y channel. Unless stated otherwise, we take Ni, = 16 and Tq ^ .01 xr. Finally, 
reported calculations were carried out in double precision using the GCC/gfortran 
4.8.1 compiler. 

4.2. Mapping. The NS logistic map for the square root iteration is ha[x\ = 
^ (3 — ax)^ with the initial rate of convergence controlled by h'^ and the smallest 
eigenvalue, xq. Various schemes for controlling the values a towards convergence 
include methods by Pan and Schreiber [95], and more recently, Jie and Chen [33], who 
demonstrated 2x acceleration for very ill-conditioned problems with their continuous 
scaling approach. 

In addition to scaling of the NS logistic, we introduce a stabilizing map that 
accounts for eigenvalues tossed out of bounds by This stabilization is the transfor¬ 
mation [0,1] ^ [0 + 5 ,1 — 5 ] (shift and scale), carried out prior to application of the 
logistic. 

The most important aspect of these scaling and stabilization maps is to turn them 
off towards convergence. Conventional methods often compute a lowest eigenvalue 
to monitor convergence [95, 33], but this may be too expensive for ill-conditioned 
problems. Alternatively, we monitor convergence simply with the relative trace error, 
tk = {n — tTXk)n~^. Then, sigmoidal functions damp scaling to unity. 



(4.1) 


and the stability parameter to zero. 



(4.2) 


These empirical damping functions are used throughout. 
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4.3. Data. Data for numerical experiments include problems from electronic 
structure and structural engineering. Electronic structure matrices were obtained 
from the non-orthogonal metric (overlap matrix) of the generalized eigenproblem, 
encountered in local support with Gaussian-Type Atomic-Orbitals (GTAOs) [64]. A 
sequence of nanotube matrices, 36x ^ 128x the (3,3) unit cell, were generated with 
TubeGen [49] and a modified STO-2G [109] basis, with an added diffuse (ffat) Gaussian 
sp-shell and exponents Cio^o = .06918 and Ciqh = .05934, corresponding to the 
condition numbers k{s) = 10^^ and k{s) = 10^^ respectively^. We also constructed a 
sequence of matrices from periodic water cubes, in increments of 100, with coordinates 
obtained using the gromacs utility gmx solvate -box [116] and the triple-^" 6-31IG** 
GTAO basis [109]. While less ill-conditioned than the nano-tube sequence, /^(s) ^ 10^, 
the water cube matrices manifest a different metric locality due to dimensionality. 
Also, we experiment with the bcsstkl4 structural engineering matrix for the Roof of 
the Omni Coliseum [123]. 

5. Error Flow. 

5.1. Stability. Stability in the square root iteration is determined by the differ¬ 
ential 


^ ^Vk — l T ^6zk-i ^ ^^k—1 (D{t ) , (^’l) 


which must remain bounded below one to avoid divergence. The corresponding Frechet 
derivatives are 


= lim 

T^O 


X + T(5gfc_i, Zk-i) - Xk 

T 


and 


^6zk-i 


lim 

r^O 


X {Vk-I^^k-I + rSzk-i) - Xk 

T 


(5.2) 


(5.3) 


along unit directions of the previous errors and Szk-i^ by an amount determined 

by the displacements 6yk-i = and 6zk-i = ||. In the single instance, 

we have simply: 


Sxk = Xszk-i X ^Zk-I + 0 {t^) . (5.4) 

This formulation makes plain changes about the resolvent, separating orientational 
effects for derivatives of the unit direction, set mostly by the underlying exact linear 
algebra, from changes to error displacements, which involve both the action of deriva¬ 
tives on previous errors, as well as current SpAMM occlusion errors. In the following 
sections, we develop this form of the error. Then, in Section 5.4, we show interesting 
behaviors of these derivatives at the edge of stability. 

5.2. Frechet derivatives. In the dual instance, Frechet derivatives occurring 
in Eq. (5.1) are: 


= Vk-l • h'JZk-l ■ Vk-I -Zk + Vk- ^Zk-l • ha [Xk-l] 

+ yk-Zk-i-Vk-i'h'aSzk-i, (5.5) 


^In this case, k is double exponential with decreasing 
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and 


[Xk-i] ■ SVk-i ■ Zk + h'JVk.i ■ Zk-I ■ Vk-i ■ Zk 

+ yk-Zk-i-h'JVk-i-Zk-i. (5.6) 

Closer to the fixed point orbit, • Zk-i Vk-i ’ ^ [xk] I and 

h^a ^ Then, 


^Vk-l • {^k - Zk-l) (5.7) 

and 

Xszk-i {Vk - Vk-i) ■ S%-1. (5.8) 

Likewise, in the single channel instance: 

Xzk-I i^k - Zk-lY ■ s ■ SZk-1 + (52fe_i -S'iZk- Zk-l) . (5.9) 

About the fixed point then, error fiow in the y and the 2 : channels is tightly quenched, 
corresponding to x^x^-i ^ identity iteration [67]. 



Fig. 5.1. Trace error and H-H of derivatives and displaeements for the unsealed dual iteration. 
Derivatives are full lines, whilst displacements for Ts = are dashed lines. The trace 

error is a full black line. 


5.3. Displacements. At each step, the accumulation of previous errors in ad¬ 
dition to the SpAMM occlusion error move the approximate iteration away from the 
unperturbed reference, here the double-precision iteration of arrays with MATMUL. 
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Fig. 5.2. Trace error and H-H of derivatives and displaeements for the scaled dual iteration. 
Derivatives are full lines, whilst displacements for Ts = are dashed lines. The trace 

error is a full black line. 


Including the SpAMM error in the Zk-i update we have: 

= ifc_2 ®r ha[Xk_2] = + Zk-2 ' [Xk- 2 ] ■ (5.10) 

Then, with [xk- 2 ] = [xk- 2 ] + h'^^Xk- 2 i and taking z^-i from both sides, 

Szk-l = + SZk-2 ■ ha [Xk- 2 ] + Zk-2 ■ h'a5Xk-2 , (5.11) 

which is bounded by 

5zk-i < \\zk- 2 \\ \\ha [xk- 2 ]\\ + h'Jyk- 2 \\zk- 2 \\) 

+ 5Zk-2 {\\ha [Xk- 2 ] II + ||yfc-2 II) . (5.12) 

In Eq. (5.12), the term h'^5yk-2\\zk-2\f is volatile, tending towards Syk -2 ti{s)/2. 
Because of this sensitivity, and because the y product channel maintains fidelity of 
the starting eigen-basis, we single out this “sensitive” product for a higher level of 
precision; <C r. 

In the single instance, the y channel is implicit in the first product involving s, 
which can be from the left or the right. In this work, the most accurate product in 
the single instance is rightmost. 

5.4. Most approximate yet still stable. The potential to compute fast and 
effective preconditioners with SpAMM is determined by the most approximate yet 
still stable (MAYSS) iteration, a challenge for increasingly ill-conditioned problems. 
Illustrative experiments were carried out on the k,{s) = 10^^ nanotube examples 
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Fig. 5.3. Trace error and H-H of derivatives and displaeements for the sealed single iteration. 
Derivatives are full lines, whilst displaeements for Ts = are dashed lines. The traee 

error is a full black line. 


described in Section 4.3. We picked r = .001 and Nij = 32. Then, we looked at 
stability with respect to the tighter threshold: In Fig. 5.1, unsealed results for the 
dual instance are shown. In Fig. 5.2, scaled results for the dual instance are given, 
and in Fig. 5.3 we show results for the scaled single instance. 

In the dual instances. Figs. 5.1 & 5.2, the bifurcating orientational components 
of the error manage to avoid the numerical catastrophe, with solid green 

converging strongly, and in solid red, with an above unity drift driving divergence 

of the displacements (dashed lines). On the other hand, bifurcation in the single 
instance (r^ = 10“^) finds the orientational component of the error, diverging 

well ahead of the displacement 5zk-i. 

In these problems, values of r near the MAYSS bifurcation do not lead to a reduced 
complexity; instead, near total fill of the product is observed towards convergence, even 
for the largest 128 x unit cell nano-tube. Also, scaling as reported in Section 4 reclaims 
about ^ 2/3 of the available 2x acceleration possible at this level off ill-conditioning 
[33], but dramatically enhances fill-in of the metric tensor, via the multiplicative effect 
of h'^ in Eq. (5.12). In addition to scaling, the single instance also results in a much 
larger volumetric fill-in, involving extended, delocalized error flows in the orbit. 

Our interpretation of these results is that despite a similar overall convergence 
behavior and error control, the tensor volumes accessed by the two instances is very 
different, due to the magnitude of norms entering the SpAMM kernel; in the dual instance 
^duai _ }i\^Xh-i]®Tyk-i behaved , while encumbers large norms 

associated with the broad spectral resolution, leading to extended delocalization of 
the metric tensor. 
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_X /2 

Fig. 5.4. Culled volumes in the thin sliee, single instanee approximation of for the k(s) = 

10^^ nanotuhe sequenee (line width inereasing with system size). With [Xq = .1 it was only possible to 

aehieve stability down to tq = 10“^ Sz Ts = 10“^^. Shown are = (^°^5k-i'S)T-h[5k-i]) ^ 100%/n^ 
(blue) and x 100%/n^ (red). Also shown is the trace error, = {n — trxj^) /n 

(black). 


6. Regularization. Even for the most approximate yet still stable approxima¬ 
tions (MAYSS), our nanotube calculations lead to delocalized products that are not 
tightly bound by Eq. (2.3), even for very large 128x unit cell systems. And while 
similarly ill-conditioned problems may achieve substantial compression with just the 
MAYSS approximation, as shown later in Fig. 7.3, the SpAMM approximation cannot 
generally yield a fast method in cases of severe ill-conditioning. 


A systematic way to reduce these effects is through Tikhonov regularization 
[92, 106]. Regularization involves a small level shift of the eigenvalues, ^ s + /i/, 

altering the condition number of the shifted matrix to ^ [106]. 

Y 


Achieving substantial acceleration with severe ill-conditioning may require a large 
level shift however, producing inverse factors of little practical use. One approach to 
recover a more accurate inverse factor is Riley’s method based on Taylor’s series [106]; 


= sy/2 


^s- 
2 ^ 


1 



( 6 . 1 ) 


For severely ill-conditioned problems and large level shifts, this expansion may converge 
very slowly. Also, adding powers of the full inverse may not be computationally effective. 
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Fig. 5.5. Culled volumes in the thin sliee, dual instanee approximation of Stqi{q for the k{s) = 
10 ^^ nanotube sequence (line width increasing with system size). The slice is ixq = .1,to = .1 Sz Ts = 

.001 thin. Shown arevy^ = x 100%/n^ (blue) andv^^ = x 

100%/n^ (red). Also shown is the trace error, = {n — trxk) /n (black), which rapidly approaches 
10“^^ (not shown). 


6.1. Product representation. We introduce an alternative representation of 
the regularized inverse factor; 


S , (6.2) 

r = To 
p=Po 

which is a telescoping product of preconditioned “slices” starting with a most- 
approximate-yet-still-effective-by-one-order (MAYEBOO) preconditioner, Srogo = 

I To /io ; Braket notation marks the potential for asymmetries in the intermedi¬ 

ate representation. This sandwich of generic, thinly sliced SpAMM products allows to 
construct a nested scoping on precision via r, and in the effective condition number 
controlled by fi. 

6.2. Effective by one order. We look again at the k { s ) = 10^^ nanotube series 
described in Section 4.3, this time with extreme regularization, /io = -1, and at a finer 
granularity, W = 8. Culled and zp volumes (as percentage of the total work) for 
36 - 128 X the (3,3) unit cell are shown for the MAYEBOO approximation in Fig. 5.4 
for the single instance, and in Fig. 5.5 for the dual instance. 

The behavior of these implementations is very different; in the single instance, a 
stable iteration could not be found at precision tq = .1. Stability could only be found 
at .01, and that with a poorly contained trace error and cull-volumes that continue to 
inflate past convergence, with a conspicuous Vk-like dependence. This behavior results 
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from the broad resolution of spectral powers Sro^o with corresponding 

large metric fields that are poorly bound by Eq. (2.3). 

On the other hand, dual iteration volumes collapse rapidly with fast trapping of 

the trace error, as Irot^o^^o^Ulo and Zk Sto^iIo^toItom^o^ and with Eq. (2.3) 

tightening to 

< Tn||s^/2j| 

and 

aAmo'-^-o'^o <rn||sy/^^||. (6.4) 

This contraction to the plane diagonal is compressive, leading to computational 
complexities that should approach quadtree copy in place. 



Fig. 6.1. Product volumes in construction of the unregularized preeonditioner |ro = .001; 
with dual instanee square root iteration, and for the 6-311G** metrie of 100 periodie water moleeules. 
At top its = h(x[x]^^i\<^^y^_i for k = 0,5,Szl7, while on the bottom we have = yk^r^k for 
= 0, 5, &17. Maroon is a, purple is b, green is c, and black is the volume vola^^b the product 

C = CL<S)rh- 


6.3. Iterative regularization. We now sketch an iterative approach to con¬ 
structing the product representation, Eq. (6.2). In the dual instance, it is possible 
to obtain a first MAYEBOO approximation which improves the con¬ 

dition number by one order of magnitude, with a numerical resolution of approxi¬ 
mately one digit. Then, a next level slice can be found, based on the residual 

(^1 (s +/ill) ^iSrogo, with e.g. /ii = .01 and ri = .01. The product 

Srogi <^iSto^q then improves the condition number by two orders of magnitude, still 
with a numerical resolution of one digit. Reflected in the preceding notation, it appears 
necessary to compute the residual at a higher level of precision, e.g. using instead 
of (gVo arid with Ti > Tq. 
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In this way, it may be possible to obtain product representation of the inverse 
square root at a SpAMM resolution that is potentially far more permissive than otherwise 
possible, 


*ro 


ri ^To^n-l 


5-1/2 

rofio ' 


(6.5) 


assuming .1 > /io > /^i > • • • Likewise, it may also be possible to obtain the full 
inverse factor with increasing numerical resolution as 




'Tm + 1 ^Tm-l ' 


.- 1/2 


( 6 . 6 ) 


and . 1 > To > Ti > ... 

Also, with each step a well conditioned generic slice, it may be possible to find a 
more effective logistic map optimized for a vanilla distribution of eigenvalues. Finally, 
relative to the regularization and precision scoping sketched here, alternative products 
are possible that may be far more efficient. We hope to pursue these efforts in future 
work. 

7. Locality. 

7.1. Spatial and metric locality. Astrophysical n-body algorithms employ 
range queries over spatial databases to hierarchically discover and compute approx¬ 
imations that commit only small errors. Often, these spatial databases are ordered 
with a space filling curve (SFC) [125, 1, 101, 124, 126, 84, 102, 2, 19, 7], which maps 
points that are close in space to an index where they are also close. Spatial locality of 
this type empowers the SpAMM approximation through Block-By-Magnitude orderings 
of the sub-space metric. 

This metric locality is compressive, but diminished by dimensionality. In Figure 
6.1, we show <S)r volumes for square root iteration, corresponding to the Gramian 
matrix of a small, periodic water box with the large 6-31IG** basis (Section 4.3). In 
this 3-d periodic case, diminishing Cartesian separations lead to long-skinny pillae 
and related delocalizations not observed in lower dimensional problems at this modest 
^ 10^ level of ill-conditioning. These delocalizations correspond to weakness in 
Eq. (2.3), and to tighter values of r^, required in the MAYSS approximation. As n 
becomes large, Cartesian separation will eventually thin these delocalizations leading 
to complexity reduction due only to metric decay. 

7.2. Algebraic locality. In addition to compression through orderings that 
maximize these block-by-magnitude effects, we demonstrate a new kind of locality 
in Figs. 7.1 and 7.2, which is, so far, uniquely exploited by the n-body approach to 
square root iteration. This locality increases compressively towards convergence, as 
contractive identity iterations develop. We call this compression lensing^ involving 
collapse of the culled volume about plane diagonals of the identity. Lensing corresponds 
to strengthening Eq.(2.3), viz Eqs. (6.3)-(6.4), and to strongly contracting directional 
derivatives, viz Eqs. (5.7)-(5.8). This is an important, mitigating effect for SpAMM 
computations in the y channel, encumbered by the parameter Tq ^ .01 x r. 

Graph reorderings that minimize the distance of matrix elements from the diagonal 
also lead to matrix locality (aforementioned). In Fig. 7.3 we show convergence of an 
unregularized (MAYSS) preconditioner for this type of ordering and the bcsstkl4 
[123] structural matrix of the Omni Coliseum in Atlanta, with = 10^^. These 
results show remarkable gossamer sheeting and flattening along plane diagonals, in 
Fig. 7.3, at top for development of as well as hollow accumulation of 
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Fig. 7.1. Product volumes in construction of the MAYEBOO preconditioner 

|to = .1,^40 = -1; with dual instance square root iteration, for 8x n(s) = 10^^ nano-tuhe. yj^ 

appears wider than zj^ because it is computed at a higher precision, Ts = .001, and because the first 
multiply involves . At top its = ha.[xk-i\^syk-i for k = 0,4,&16, while on the bottom we 
have Xk = yk^r^k for k = 0,2, &16. Maroon is a, purple is b, green is c, and black is the volume 
vola(g)^6 in the product c = 


at bottom. Interestingly, this example demonstrates well lensed volumes towards 
convergence, whilst the equally ill-conditioned and lower dimensional k { s ) = 10 ^^ 
nanotube demands a much tighter value of (10“^ vs. 10“^) and retains dense 
volumes through 128 x the unit cell. 

7.3. Complexity reduction. Finally, we show complexity reduction at conver¬ 
gence of the MAYEBOO approximation relative to the MAYSS approximation, in 
Fig. 7.4 for periodic water boxes, and in Fig. 7.5 for the ill-conditioned nano-tube. The 
two-orders difference between and volumes corresponds precisely to ^ r x .01, 
with Xk in between. Except for the slower trend in Fig. (7.4)’s Xk volume, we see the 
potential for continued strong acceleration with increasing system size. 

8. Conclusions. In this work, we developed the SpAMM n-body solver for square 
root iteration, along with some algebra for the operator and showed how we could 
exploit different types of locality in the sub-space metric of the product-tensor. Our 
main contributions include a modified Cauchy-Schwarz criterion for the SpAMM occlusion- 
cull, Eq. (2.2), and proof that the corresponding relative error in the product is bound 
by Eq. (2.3). We showed how block-by-magnitude orderings and locality of the sub¬ 
space metric leads to reduced complexity of the SpAMM kernel, involving low-dimensional 
sub-structures that bound the relative error, distributed along plane-diagonals and 
along their their intersection at the cube-diagonal. Perhaps most significantly, we 
demonstrated a new kind of compressive locality, leasing, that develops in the (g)^- 
volume on contractive identity iteration, together with tightening the SpAMM bound, 
viz Eqs. (6.3)-(6.4). 
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Fig. 7.2. Product volumes in construction of the MAYEBOO preconditioner 
|to = -1,^0 = with dual instance square root iteration, for 6-3110*"^ box of 100 peri¬ 
odic water molecules. At top its = /ia for k = 0,4, &15^ while on the bottom we 

have Xk = Vk^r^k for k = 0,4, &15. Maroon is a, purple is b, green is c, and black is the volume 
vola(g)^6 in the product c = 


Additional contributions include development and implementation of a first order 
Frechet analyses for the single and dual instances of the NS square root iteration, with 
focus on separating directional effects that are mostly controlled by the unperturbed 
reference algebra, from the magnitude of SpAMM occlusion errors and their accumulation. 
We found that numerical sensitivity develops primarily in the 2 : channel, according to 
Eq. (5.12), due to amplification of 6y by terms approaching condition of the full inverse; 
we then looked at sensitivity to this error, bifurcations, controled by (Figs. 5.1-5.3), 
concluding that a most approximate, naive application of SpAMM to the ill-conditioned 
problem is generally insufficient to achieve a fast solution. 

Finally, we introduced scoping on both precision and regularization in product 
representation of the inverse factor, and demonstrated the potential for orders of 
magnitude compression in the dual instance. Figs. 7.4-7.5, with the most extreme, 
“by-one-order” slice of the nested factor, providing a foothold for this expansion at 
To = .1. A next step is to demonstrate full bootstrapping of the inverse factor with 
reduced complexity, i.e. via a compact, nested product of well lensed terms, a work in 
progress. 
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Fig. 7.4. Complexity reduction in metric square root iteration for the periodic 6-3110'^'^ water 
sequence. Shown is the ratio of lensed product volumes for the regularized MAYEBOO approximation 
with respect to the unregularized (MAYSS) approximation. 



Fig. 7.5. Complexity reduction in square root iteration for the k,(s) = 10^10) sequence. Shown 
is the ratio of lensed product volumes for the regularized MAYEBOO approximation with respect to 
the unregularized MAYSS approximation, which we take to he . 






